Follow

unix-style text search question/help 

I want to begin using command line utilities to query a directory of markdown files for things like: a) the number of instances of a given string, b) the number of files with one or more instances of a given string, c) a # of chars after a given string in a given file, among others. I know sed/awk/grep may be a place to start but I'm not sure how to narrow that set of features to the above use case. Any help/direction on some useful concepts or tools?

· · Web · 2 · 2 · 2

unix-style text search question/help 

@inscript everything there could be done with just grep, and wc (wordcount). give me a moment and i’ll give you
1. the answer
2. how i arrive at the answer

unix-style text search question/help 

@zens that would be great

unix-style text search question/help 

@inscript so the first one I kinda know off the top of my head, you just want the count of a given string.

grep -R givenstring directoryname

this prints one line per match

grep -R givenstring directoryname | wc -l

this counts lines.

the tricky thing here is a number of utilities accept either -R or -r as the option to enable a "recursive" mode, but it's hard to remember which for any given utility.
also, search string first, always.

unix-style text search question/help 

@inscript for the second one, I consulted grep's 'man page'.
typed
man grep

these aren't always the most readable or sensible documentation pages, but they're worth checking out before hitting the net.

in here I found a grep option "-l"
this prints only the filename for each match, and stops searching a file as soon as it finds one match.

just add it to grep in the previous command, and you get "file count"

unix-style text search question/help 

@inscript it's worth noting that the input for grep is a regex, not just a plain text string, meaning my examples will work for simple searches, but if you intend to parameterise this as part of a program, where you don't control what goes into the search, it might be worth adding the "-w" switch, which changes the interpretation to searching for a plain string, not a regex

unix-style text search question/help 

@inscript for the third one, I've been scanning the man page for its options around "context".
by default, grep does print some context around the matched word. the first example prints filename, and the full line that the word appears in.

I couldn't find anything that trims it to just a few characters, so we might indeed need to use sed.

unix-style text search question/help 

@inscript so on my own machine when i enter the command
grep -R mustache memenotes

one of the lines it prints look like this:

memenotes/meme.markdown:* aim for 100% mustache spec

suppose we wanted it to only print

spec

for that line.

How familiar are you with regexes?

unix-style text search question/help 

@zens very little, but I can read up on them

unix-style text search question/help 

@inscript regexes have a well deserved reputation for being a write only language. With a bit of familiarity, and some memorisation, they're sort of easy to write, but hard to read.

what I find most challenging about them though, in this context. is that unix tools each have their own mutually incompatible dialect of regex. which is why I tend to use perl instead of sed, since the rest of the world has tended towards making their regexes compatible with perl

unix-style text search question/help 

@inscript for the simple stuff I do, this looks mostly the same, except the regexes work the same as everywhere else. while sed's do not.

unix-style text search question/help 

@inscript the basic pattern for a sed substitution is

sed -e "s/pattern/substitution/"'

so if you feed in a line like this

a pattern is you

the output would be

a substitution is you

I substitute "sed -e" with "perl -pe", which does the same.

unix-style text search question/help 

@inscript so all together I get

grep -R mustache memenotes | perl -pe "s/.*mustache(.{1,4})?.*/\1/"

which prints the 1 to 4 characters after each instance of the word "mustache" found, including any spaces.

example output:

var

, jq
spe

can

/ h

unix-style text search question/help 

@inscript I'll break down this part a bit more
perl -pe "s/.*mustache(.{1,4})?.*/\1/"

so the "s/" stands for the substituion command in both sed and perl. the pattern ".*" means match zero or more of any character. Then the part in the line I want to find, (mustache), then I use ( ) parens as a capture group. this groups the pattern inside so I can both refer to it in the substitution, and modify the whole pattern as a group with ? meaning "maybe".

unix-style text search question/help 

@inscript the whole pattern is wrapped in .* on either side because we don't just want to replace one thing in the middle, we want to replace the whole line, with whatever we found in the capture group. We refer to the capture group with \1 in the substitution.

and finally, the stuff inside the parenthesis capture group is just "match between 1 and 4 of any character).

*breaths out*

not exactly straight forward, and I arrived there after much fiddling

unix-style text search question/help 

@zens Wow, thank you so much for taking the time to put this together. I fear you have opened the rabbit hole just enough for me to fall in. I had a passing curiosity in perl so this may be great opportunity to dig in. This was super helpful. Thanks again!

unix-style text search question/help 

@inscript i just realised of course, there may be an issue with my example solution- and that’s i haven’t accounted for or tested the possibility of the word i am searching for occuring multiple times on one line. so buyer beware

unix-style text search question/help 

@zens Good to know. I'm doing some tests and it's already crazy informative/instructive for my note syntax.

re: unix-style text search question/help 

@inscript Well, @zens answers are great, and highlight that there are always multiple ways to get a job done :-)

But given that you started by asking how to choose between sed/awk/grep I thought I'd talk about that instead ...

Those three core tools all have a massive overlap. They all prefer to operate on "text" presented "one line at a time". That's how they're structured, and if that's what you have they're often perfect ...

The simplest one is grep - this tool concentrates on finding things, but doesn't do much with them. You can do grep's job with sed and awk as well, but grep is generally the most efficient way to find content within text files.

The elephant in the room is that you probably don't need efficiency on a modern computer, not the way it used to be so critical 🙂

sed's job is to edit each line that it finds. It's great at this, but you have to know whether your input files are really organised on a line basis. For example, you can ask sed to find "two words" and replace them with "three words" ... but it won't work very well (by default) if there's a line break in your text ...

awk is a much bigger tool. It'll totally do any job grep or sed will do, but it also adds the idea of setting variables and keeping track of things like running totals while it works through a file - which means awk can be asked to present summaries of things it noticed about a file, rather than simply find or edit bits from it. This means that awk is much more like a programming language than the others ...

Then moving in to Perl, which is a programming language pure and simple - but it's one that was written with the intention of making work with text files easy - the author had a lot of text files on his system to organise, change, summarise, manage ... and perl's origin is a tool to help him do that, although by now it really is a full general purpose language. But that sort-of explains why a simple perl command can achieve so much!

re: unix-style text search question/help 

@yojimbo Nice to meet you. This is a great higher level comparison. I will definitely do some digging here. I had thought of alternatively titled the OP "I too want to be a plumber" as my intention is to learn the logic of piping simple commands together but I'm not opposed to something robust enough to get the job done all in one go.

re: unix-style text search question/help 

@inscript There are a whole load of lower-level tools as well - cut, paste, head, tail, sort, uniq ... all generally found in the GNU Coreutils package gnu.org/software/coreutils/man

re: unix-style text search question/help 

@yojimbo Very interesting. I need to start spamming man, I think.

re: unix-style text search question/help 

@inscript I often build up a text transform one step at a time, using a different command for each step, until I've punished the input into the shape I want ... and then re-implement the job into a single-step invocation of something more complex.

Of course there's always the "can I do this all in sed?" "can I do this all in awk?" exercises, just for the fun of them. sed's language is unexpectedly complex and can do conditionals & jumps ... but I still wouldn't recommend them for real code!

re: unix-style text search question/help 

@yojimbo so can I ask, once you've found a set of commands that achieve some end, do you always package them into a shell script, or alias? What's the way you abstract over multiple commands from different utilities besides simple piping?

re: unix-style text search question/help 

@inscript I tend to put things in a shell script (and it lives in $HOME/bin) - because that way there's more scope for comments and formatting to make it clear what's happening.

The biggest problem with code isn't writing it, it's being able to understand it later :-) So any way to get documentation into something is a help ...

re: unix-style text search question/help 

@yojimbo cool, this is plenty to chew on. thanks again for commenting. sending friendly inter-instance vibes.

re: unix-style text search question/help 

@inscript I have very few aliases or functions, but my absolute favourite is one that couldn't be a shell script ...

mkcd()
{
mkdir -p "$1" && cd "$1"
}

Sign in to participate in the conversation
Merveilles

Merveilles is a community project aimed at the establishment of new ways of speaking, seeing and organizing information — A culture that seeks augmentation through the arts of engineering and design. A warm welcome to any like-minded people who feel these ideals resonate with them.